[HUDI-5857] Insert overwrite into bucket table would generate new file group id by beyond1920 · Pull Request #8072 · apache/hudi

beyond1920 · 2023-02-28T04:06:53Z

Change Logs

Snapshot query result is wrong after apply insert overwrite to an existed table with simple bucket index.
see detailed in HUDI-5857.
The root cause of the bug is the write handler reuses existed bucket file id for insert overwrite. Besides it use replace commit for insert overwrite operation and mark all the existed bucket file id as replaced.
So the snapshot query result is wrong.
The pr aims to fix this bug by generating new file id for bucket if insert overwrite into bucket index table.

Impact

NA

Risk level (write none, low medium or high below)

NA

Documentation Update

NA

Contributor's checklist

Read through contributor's guide
Change Logs and Impact were stated clearly
Adequate tests were added if applicable
CI passed

hudi-spark-datasource/hudi-spark/src/test/scala/org/apache/spark/sql/hudi/TestInsertTable.scala

XuQianJin-Stars · 2023-02-28T05:30:39Z

Thanks @beyond1920 overall looks good.

KnightChess

how about remove tagLocation when use insert overwrite op, like #8073 , I think this can also solve this quesion too. And the scenario in insert overwrite I think no need to tagLocation, right?

KnightChess · 2023-02-28T05:56:37Z

.../main/java/org/apache/hudi/table/action/commit/SparkInsertOverwriteCommitActionExecutor.java

  protected Partitioner getPartitioner(WorkloadProfile profile) {
    return table.getStorageLayout().layoutPartitionerClass()
-        .map(c -> getLayoutPartitioner(profile, c))
+        .map(c -> c.equals(HoodieLayoutConfig.SIMPLE_BUCKET_LAYOUT_PARTITIONER_CLASS_NAME)


does consistentBucketIndex will not cause the same problem?

No, consistentBucketIndex works correctly, it would generate different file ids.

@KnightChess Thanks for your advice.
Remove tagLocation could also fixed this problem. However I prefer to fix this problem by generate new file ids because:

Remove tag location would change stats, for example, miss updated count

It's better to keep same behavior for all index types instead of only remove tag location in insert overwrite for bucket index table.
But remove tag location is a good improvement to speed up insert overwrite. I would created a new JIRA to track this issue. Maybe using bulk insert to do insert overwrite for all index typed. WDYT?

@beyond1920 I read consistentBucketIndex implementation, found it must tag incomming records to allocation fgId, so #8073 will cause some quesion

No, consistentBucketIndex works correctly, it would generate different file ids.

consistentBucketIndex can not work correctly, change the ut case

emm, sorry for the hurry response before.
Thank you for point it out.
I need to spend more time to get familiar with ConsistentBucketIndex. I would response ASAP.

XuQianJin-Stars · 2023-02-28T06:14:16Z

hudi-spark-datasource/hudi-spark/src/test/scala/org/apache/spark/sql/hudi/TestInsertTable.scala

  }

-  test("Test Insert Overwrite") {
+  test("Test Insert Overwrite for bucket ") {


add test for consistentBucketIndex

ConsistentBucketIndex works correctly, it would generate different file ids.
However, I add the test cases for consitentBucketIndex too.

KnightChess · 2023-02-28T15:20:06Z

hudi-spark-datasource/hudi-spark/src/test/scala/org/apache/spark/sql/hudi/TestInsertTable.scala

+      // Insert overwrite static partition
+      spark.sql(
+        s"""
+           | insert overwrite table $tableName partition(dt = '2021-01-05')


this will create a new parquet file with the same prefix against log file, but something diff in fgId suffix. just like the picture, create new parquet file will add -0 after fgId(xxx-0-0_xxx), so it can be read if only insert overwrite onece, but if insert overwrite again, will use the same fgId(xxx-0-0), result nothing.

XuQianJin-Stars

LGTM

...di-client-common/src/main/java/org/apache/hudi/exception/HoodieInsertOverwriteException.java

.../main/java/org/apache/hudi/table/action/commit/SparkInsertOverwriteCommitActionExecutor.java

danny0405 · 2023-03-02T04:42:58Z

.../main/java/org/apache/hudi/table/action/commit/SparkInsertOverwriteCommitActionExecutor.java

+        case CONSISTENT_HASHING:
+          return new SparkInsertOverwriteConsistentBucketIndexPartitioner(profile, context, table, config);
+        default:
+          throw new HoodieNotSupportedException("Unknown bucket index engine type: " + config.getBucketIndexEngineType());


Can we inline all the different handling for getBucketInfo into SparkInsertOverwritePartitioner ? Let's make the code cleaner.

I move part of them which related to ConsistentBucketIndex to SparkInsertOverwritePartitioner.
And I left other part which related to SimpleBucketIndex in SparkBucketIndexInsertOverwritePartitioner.
Because SimpleBucketIndex and ConsistentBucketIndex are different when creates new BucketInfo.

danny0405 · 2023-03-02T04:44:26Z

.../main/java/org/apache/hudi/table/action/commit/SparkInsertOverwriteCommitActionExecutor.java

+      return handleInsert(binfo.fileIdPrefix, recordItr);
+    } else if (btype.equals(BucketType.UPDATE)) {
+      throw new HoodieInsertOverwriteException(
+          "Insert overwrite should always use INSERT bucketType, please correct the logical of " + partitioner.getClass().getName());


In which case we can hit the code path for BucketType.UPDATE ?

This is a protected code to prevent hit this bug again when introduce new partitioner class in the future.

...lient/src/main/java/org/apache/hudi/table/action/commit/SparkInsertOverwritePartitioner.java

danny0405 · 2023-03-09T06:31:38Z

...ain/java/org/apache/hudi/table/action/commit/SparkBucketIndexInsertOverwritePartitioner.java

+
+  @Override
+  public BucketInfo getBucketInfo(int bucketNumber) {
+    String partitionPath = partitionPaths.get(bucketNumber / numBuckets);


In HoodieWriteConfig, we can fetch the operation then decides whether it is INSERT_OVERWRITE, then the logic can be moved into SparkBucketIndexPartitioner.

… group id

danny0405

+1, we are good to land once the CI is green

…rt overwrite behavior

beyond1920 · 2023-03-09T12:35:25Z

@hudi-bot run azure

hudi-bot · 2023-03-09T17:03:04Z

CI report:

7d6319d Azure: SUCCESS

Bot commands

@hudi-bot supports the following commands:

@hudi-bot run azure re-run the last Azure build

…e group id (apache#8072)

XuQianJin-Stars reviewed Feb 28, 2023

View reviewed changes

hudi-spark-datasource/hudi-spark/src/test/scala/org/apache/spark/sql/hudi/TestInsertTable.scala Outdated Show resolved Hide resolved

KnightChess mentioned this pull request Feb 28, 2023

[HUDI-5861] fix table can not read data after overwrite table with bu… #8073

Closed

4 tasks

voonhous mentioned this pull request Feb 28, 2023

[HUDI-5529] Ensure that consistent hashing metadata is purged when dropping a partition #7645

Closed

4 tasks

KnightChess reviewed Feb 28, 2023

View reviewed changes

XuQianJin-Stars reviewed Feb 28, 2023

View reviewed changes

beyond1920 force-pushed the fixInsertOverwriteIntoBucketTable branch from 9a4747e to 5fa9c60 Compare February 28, 2023 10:19

KnightChess reviewed Feb 28, 2023

View reviewed changes

beyond1920 force-pushed the fixInsertOverwriteIntoBucketTable branch from 5fa9c60 to d4e9858 Compare March 1, 2023 06:31

XuQianJin-Stars approved these changes Mar 1, 2023

View reviewed changes

beyond1920 force-pushed the fixInsertOverwriteIntoBucketTable branch from d4e9858 to d0099b9 Compare March 1, 2023 11:06

danny0405 changed the title ~~[HUDI-5857] Index overwrite into bucket table would generate new file group id~~ [HUDI-5857] Insert overwrite into bucket table would generate new file group id Mar 2, 2023

danny0405 reviewed Mar 2, 2023

View reviewed changes

...di-client-common/src/main/java/org/apache/hudi/exception/HoodieInsertOverwriteException.java Outdated Show resolved Hide resolved

danny0405 reviewed Mar 2, 2023

View reviewed changes

.../main/java/org/apache/hudi/table/action/commit/SparkInsertOverwriteCommitActionExecutor.java Outdated Show resolved Hide resolved

danny0405 reviewed Mar 2, 2023

View reviewed changes

beyond1920 force-pushed the fixInsertOverwriteIntoBucketTable branch from d0099b9 to a76dc55 Compare March 3, 2023 08:13

danny0405 reviewed Mar 8, 2023

View reviewed changes

...lient/src/main/java/org/apache/hudi/table/action/commit/SparkInsertOverwritePartitioner.java Show resolved Hide resolved

danny0405 reviewed Mar 9, 2023

View reviewed changes

beyond1920 added 4 commits March 9, 2023 17:33

[HUDI-5857] Index overwrite into bucket table would generate new file…

4a05f53

… group id

Add testcases

5f16a44

Fix bug insert overwrite into consistent hash index table

63f3100

Update based on Danny's comment

7952831

beyond1920 force-pushed the fixInsertOverwriteIntoBucketTable branch 2 times, most recently from 2015eb0 to 7267340 Compare March 9, 2023 09:38

danny0405 approved these changes Mar 9, 2023

View reviewed changes

Overwrite getBucketInfo of SparkBucketIndexPartitioner to handle inse…

f550344

…rt overwrite behavior

beyond1920 force-pushed the fixInsertOverwriteIntoBucketTable branch from 7267340 to f550344 Compare March 9, 2023 09:48

fix failure cases

7d6319d

XuQianJin-Stars merged commit 51d0351 into apache:master Mar 10, 2023

nsivabalan pushed a commit to nsivabalan/hudi that referenced this pull request Mar 18, 2023

[HUDI-5857] Insert overwrite into bucket table would generate new fil…

c0f0e50

…e group id (apache#8072)

nsivabalan pushed a commit to nsivabalan/hudi that referenced this pull request Mar 22, 2023

[HUDI-5857] Insert overwrite into bucket table would generate new fil…

ef96ed8

…e group id (apache#8072)

southernriver pushed a commit to southernriver/hudi that referenced this pull request Mar 31, 2023

[HUDI-5857] Insert overwrite into bucket table would generate new fil…

ef6b63b

…e group id (apache#8072)

fengjian428 pushed a commit to fengjian428/hudi that referenced this pull request Apr 5, 2023

[HUDI-5857] Insert overwrite into bucket table would generate new fil…

cbba7fe

…e group id (apache#8072)

stayrascal pushed a commit to stayrascal/hudi that referenced this pull request Apr 20, 2023

[HUDI-5857] Insert overwrite into bucket table would generate new fil…

4ddcb68

…e group id (apache#8072)

KnightChess pushed a commit to KnightChess/hudi that referenced this pull request Jan 2, 2024

[HUDI-5857] Insert overwrite into bucket table would generate new fil…

e5f5800

…e group id (apache#8072)

Conversation

beyond1920 commented Feb 28, 2023

Change Logs

Impact

Risk level (write none, low medium or high below)

Documentation Update

Contributor's checklist

Uh oh!

Uh oh!

XuQianJin-Stars commented Feb 28, 2023

Uh oh!

KnightChess left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

KnightChess Feb 28, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

XuQianJin-Stars left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

danny0405 left a comment

Choose a reason for hiding this comment

Uh oh!

beyond1920 commented Mar 9, 2023

Uh oh!

hudi-bot commented Mar 9, 2023

CI report:

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

KnightChess Feb 28, 2023 •

edited

Loading